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Abstract 

When comparing inductive logic programming (ILP) and attribute- 
value learning techniques, there is a trade-off between expressive power 
and efficiency. Inductive logic programming techniques are typically 
more expressive but also less efficient. Therefore, the data sets handled 
by current inductive logic programming systems are small according 
to general standards within the data mining community. The main 
source of inefficiency lies in the assumption that several examples may 
be related to each other, so they cannot be handled independently. 

Within the learning from interpretations framework for inductive 
logic programming this assumption is unnecessary, which allows to scale 
up existing ILP algorithms. In this paper we explain this learning setting 
in the context of relational databases. We relate the setting to propo- 
sitional data mining and to the classical ILP setting, and show that 
learning from interpretations corresponds to learning from multiple re- 
lations and thus extends the expressiveness of propositional learning, 
while maintaining its efffciency to a large extent (which is not the case 
in the classical ILP setting). 

As a case study, we present two alternative implementations of the 
ILP system Tilde (Top-down Induction of Logical DEcision trees): Tilde c/assic, 
which loads all data in main memory, and TildeLDS, which loads the 
examples one by one. We experimentally compare the implementations, 
showing TildeiZ)5 can handle large data sets (in the order of 100,000 
examples or 100 MB) and indeed scales up linearly in the number of 
examples. 

Keywords : Inductive logic programming, machine learning, data mining. 
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1 Introduction 



There is a general trade-ofF in computer science between expressive power and effi- 
ciency. Tfieorem proving in first order logic is less efficient but more expressive than 
theorem proving in propositional logic. It is therefore no surprise that first order 
induction techniques (such as those studied within inductive logic programming) 
are less efficient than propositional or attribute-value learning techniques. On the 
other hand, inductive logic programming is able to solve induction problems beyond 
the scope of attribute value learning, cf. (Bratko and Muggleton, 1995). 

The computational requirements of inductive logic programming systems are 
higher than those of propositional learners due to the following reasons: first, the 
space of clauses considered by inductive logic programming systems typically is 
much larger than that of propositional learners and can even be infinite. Second, 
testing whether a clause covers an example is more complex than in attribute value 
learners. In attribute value learners an example corresponds to a single tuple in 
a relational database, whereas in inductive logic programming one example may 
correspond to multiple tuples of multiple relations. Therefore, the coverage test in 
inductive logic programming needs a database system to solve complex queries or 
even a theorem prover. Third, and this is related to the second point, in attribute 
value learning testing whether an example is covered is done locally, i.e. indepen- 
dently of the other examples. Therefore, even if the data set is huge, a specific 
coverage test can be performed efficiently. This contrasts with the large majority 
of inductive logic programming systems, such as FOIL (Quinlan, 1990) or Progol 
(Muggleton, 1995), in which coverage is tested globally, i.e. to test the coverage of 
one example the whole ensemble of examples and background theory needs to be 
considered]^. Global coverage tests are much more expensive than local ones. More- 
over, systems using global coverage tests are hard to scale up. Due to the fact that 
one single coverage test (on one example) typically takes more than constant time 
in the size of the database, the complexity of induction systems exploiting global 
coverage tests will grow more than linearly in the number of examples. 

In a more recent setting for inductive logic programming, called learning from 
interpretations (De Raedt and Dzeroski, 1994; De Raedt et al., 1998), it is assumed 
that each example is a small database (or a part of a global database), and local 
coverage tests are performed. Algorithms using local coverage tests are typically 
linear in the number of examples. Furthermore, as each example can be loaded 
independently of the other ones, there is no need to use a database system even 
when the whole data set cannot be loaded into main memory. 

Within the setting of learning from interpretations, we investigate the issue 
of scaling up inductive logic programming. More specifically, we present two al- 
ternative implementations of the Tilde system (Blocked and De Raedt, 1998): 
Tilde c/asszc, which loads all data in main memory, and TildeLDS, which loads 
the examples one by one. The latter is inspired by the work by Mehta et al. (1996), 
who propose a level- wise algorithm that needs one pass through the data per level of 
the tree it builds. Furthermore, we experimentally compare the algorithms on large 
data sets involving 100,000 examples (in the order of 100 MBytes). The experiments 
clearly show that inductive logic programming systems can be scaled up to satisfy 
the standards imposed by the data mining community. At the same time, this pro- 
vides evidence in favor of local coverage tests (as in learning from interpretations) 
in inductive logic programming. 

This article is organized as follows. In Section 2 we introduce the learning from 
interpretations setting and relate it to the relational database context. In Section 
3 we introduce first order logical decision trees and discuss the ILP system Tilde, 

-"^E.g., testing the coverage of member{a, [b,a]) may depend on member{a, [a]). 
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which induces such trees. Section 4 shows how many propositional techniques can 
be upgraded to the learning from interpretations setting (using Tilde as an illustra- 
tion), and discusses why this is much harder for the classical ILP setting. Section 
5 reports on experiments with Tilde through which we empirically validate our 
claims, Section 6 discusses some related work and in Section 7 we conclude. 

2 The learning setting 

We first introduce the problem specification in a logical context, then discuss it in 
the context of relational databases, and finally relate it to the standard inductive 
logic programming setting. 

We assume familiarity with Prolog or Datalog (see e.g. (Bratko, 1990)), and 
relational databases (see e.g. (Elmasri and Navathe, 1989)). 

A word on our notation: in logical formulae we will adopt the Prolog convention 
that names starting with a capital denote variables, and names starting with a 
lowercase character denote constants. 

2.1 Problem specification 

In our framework, each example is a set of facts. These facts encode the specific 
properties of the examples in a database. Furthermore, each example is classi- 
fied into one of a finite set of possible classes. One may also specify background 
knowledge in the form of a Prolog program. 
More formally, the problem specification is: 

Given: 

• a set of classes C (each class label c is a nuUary predicate) , 

• a set of classified examples E (each element of E is of the form (e, c) with e 
a set of facts and c a class label) 

• and a background theory B, 

Find: a hypothesis H (a Prolog program), such that for all (e, c) G E, 

• H A e A B \= c, and 

• yc' eC-{c}:H AeAB^c' 

This setting is known in inductive logic programming under the label learning 
from interpretations (De Raedt and Dzeroski, 1994; De Raedt, 1997; De Raedt et 
al., 1998) (an interpretation is just a set of facts). Notice that within this setting, 
one always learns first order definitions of propositional predicates (the classes) . An 
implicit assumption is that the class of an example depends on that example only, 
not on any other examples. This is a reasonable assumption for many classification 
problems, though not for all; it precludes, e.g., recursive concept definitions. 

Example 1 Figure ^ shows a set of pictures each of which is labelled or ©. The 
task is to classify new pictures into one of these classes by looking at the objects 
in the pictures. We call this kind of problems Bongard-problems, after Mikhail 
Bongard, who used similar problems for pattern recognition tests (Bongard, 1970). 

Assuming we only consider the shape, configuration (pointing upwards or down- 
wards, for triangles only) and relative position (objects may be inside other objects) 
of objects, the pictures in Figure can be represented as follows: 
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Figure 1: Bongard problems 



Picture 1: {circle (ol), triangle (o2) , points (o2, up), inside (o2, ol)} 
Picture 2: {circle (o3), trisingle (o4) , points (o4, up), triangle (o5) , 
points(o5, down), inside(o4, o5)} 

etc. 

(The Oi are constants denoting geometric objects. The exact names of these con- 
stants are of no importance; they will not he referred to in the first order hypothesis.) 

Background knowledge might he provided to the learner, e.g., the following defi- 
nitions could he in the background: 

doubletriangle(01,D2) :- triangle (01), triangle (02) , 01 / 02. 
polygon(O) :- triangle(O). 
polygon(O) :- square(O). 

When considering a particular example (e.g. Picture 2) in conjunction with the 
background knowledge it is possible to deduce additional facts in the example. For 
instance, in Picture 2, i/ie /acts doubletriangle(o4,o5) anc? polygon (o4) hold. 

The format of a hypothesis in this setting wiU be illustrated later. 
2.2 Learning from Multiple Relations 

The learning from interpretations setting, as introduced before, can easily be related 
to learning from multiple relations in a relational database. 

Typically, each predicate will correspond to one relation in the relational database. 
Each fact in an interpretation is a tuple in the database, and an interpretation cor- 
responds to a part of the database (a set of tuples). Background knowledge can be 
expressed by means of views as well as extensional tables. 

Example 2 For the Bongard example, the following database contains a description 
of the first two pictures in Figure [7| (note that an extra relation CONTAINS is 
introduced, linking objects to pictures; this relation was implicit in the previous 
representation) : 

CONTAINS 
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picture 


object 


1 


ol 


1 


o2 


2 


o3 


2 


o4 


2 


o5 



CIRCLE 


TRIANGLE 


POINTS 




INSIDE 




object 


object 


object 


direction 


inner 


outer 


ol 


o2 


o2 


up 


o2 


ol 


o3 


o4 


o4 


up 


o4 


o5 




o5 


o5 


down 







The background knowledge can he defined using views, as follows: (we are as- 
suming here that a relation SQUARE is also defined) 

DEFINE VIEW doubletriangle AS 
SELECT cl. object, c2. object 
FROM contains cl, c2 
WHERE cl. object <> c2. object 

AND cl. picture = c2. picture 

AND cl. object IN triangle 

AND c2. object IN triangle; 

DEFINE VIEW polygon AS 
SELECT object FROM triangle 

UNION 

SELECT object FROM square; 

In this example the background knowledge is in a sense redundant: it is com- 
puted from the other relations. This is not necessarily the case. The following 
example illustrates this. It is also a more realistic example of an application where 
mining multiple relations is useful. 

Example 3 Assume that one has a relational database describing molecules. The 
molecules themselves are described by listing the atoms and bonds that occur in them, 
as well as some properties of the molecule as a whole. Mendelev's periodic table of 
elements is a good, example of background knowledge about this domain. 

The following tables illustrate what such a chemical database could look like: 



MEND EL EV 



number 


symbol 


atomic weight 


electrons in outer layer 


1 


H 


1.0079 


1 


2 


He 


4.0026 


2 


3 


Li 


6.941 


1 


4 


Be 


9.0121 


2 


5 


B 


10.811 


3 


6 


C 


12.011 


4 



MOLECULES CONTAINS 
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formula 



name 



class 



molecule atomjd 



H2O 
CO2 

CO 

CH3OH 



water 
carbon dioxide 
carbon monoxide 
methane 
methanol 



inorganic 
inorganic 
inorganic 
organic 
organic 



H2O h2o-l 

H2O h2o-2 

H2O h2o-3 

CO2 co2-l 

CO2 co2-2 



ATOMS 



BONDS 



atomJd element 



atomJdl atom_id2 type 



h2o-l H 

h2o-2 

h2o-3 H 

co2-l 



h2o-l h2o-2 single 

h2o-2 h2o-3 single 

co2-l co2-2 double 

co2-2 co2-3 double 



A possible classification problem here is to classify unseen molecules into organic 
and inorganic molecules, based on their chemical structure. 

Notice that this representation of examples and background knowledge upgrades 
the typical attribute value learning representation in two respects. First, in attribute 
value learning an example corresponds to a single tuple for a single relation. Our 
representation allows for multiple tuples in multiple relations. Second, it also allows 
for using background knowledge. 

By joining all the relations in a database into one huge relation, one can of 
course eliminate the need for learning from multiple relations. The above example 
should make clear that in many cases this is not an option. The information in 
Mendelev's table, for instance, would be duplicated many times. Moreover, unless 
a multiple-instance learner is used (sec e.g. (Dietterich et al., 1997)) all the atoms a 
molecule consists of, together with their properties, have to be stored in one tuple, 
so that an indefinite number of attributes is needed; see (De Raedt, 1998) for a 
more detailed discussion. 

While mining such a database is not feasible using propositional techniques, it 
is feasible using learning from interpretations. We proceed to show how a relational 
database can be converted into a suitable format. 

Conversion from relational database to interpretations 

Converting a relational database to a set of interpretations can be done easily and 

in a scmi-automatod way, as follows: 

Decide which relations are background knowledge. 
Let DB be the original database without the background relations. 
Choose an attribute in a relation that uniquely identifies the examples. 
For each value i of that attribute: 

S := set of all tuples m DB containing that value 

repeat 

S := 5 U set of all tuples in DB referred to by a foreign key in S 
until S does not change anymore 



The tuples in S are here assumed to be labelled with the name of the relation 
they arc part of. A tuple [attri, . . . , attrn) of a relation R can trivially be converted 
to a fact R{attri, . . . , attrn). By doing this conversion for all Si, each Si becomes 
a set of facts describing an individual example i. The extensional background 



Si ■.= S 
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relations can be converted in the same manner into one set of facts that forms tlie 
background knowledge. Background relations defined by views can be converted to 
equivalent Prolog programs. 

The only parts in this conversion process that are hard to automate are the 
selection of the background knowledge (typically, one selects those relations where 
each tuple can be relevant for many examples) and the conversion of view definitions 
to Prolog programs. Also, the user must indicate which attribute should be chosen 
as an example identifier, as this depends on the learning task. 

Example 4 In the chemical database, we choose as example identifier the molecular 
formula. The background knowledge consists of the table MENDELEV. In order 
to build a description of H2O, one first collects the tuples containing H2O; these 
are present in MOLECULES and CONTAINS. These tuples contain references to 
atom_id 's h2o-i, i = 1,2, 3, so the tuples containing those symbols are also collected 
(tuples from ATOMS and BONDS). These again refer to the elements H and O, 
which are foreign keys for the MENDELEV relation. Since this relation is in the 
background, no further tuples are collected. Converting the tuples to facts, we get 
the following description of II2O: 

{molecules ('H20', water, inorganic), contains ('II20', h2o-l), contains ('H20', h2o- 
2), contains ('II20', h2o-3), atoms(h2o-l, 'H'), atoms(h2o-2, '0'), atoms(h2o-3, 
'H'), bonds(h2o-l, h2o-2, single), bonds(h2o-2, h2o-3, single)} 

Some variations of this algorithm can be considered. For instance, when the 
example identifier has no meaning except that it identifies the example (as the 
picture numbers 1 and 2 for the Bongard example), this attribute can be left out 
from the example description. 

The key notion in this conversion process is localization of information. It is 
assumed that for each example only a relatively small part of the database is rele- 
vant, and that this part can be localized and extracted. From now on, we will refer 
to this assumption as the locality assumption. 

2.3 The standard ILP setting 

We now briefly discuss the standard ILP setting and how it differs from our setting. 
For a more thorough discussion of different ILP settings and the relationships among 
them we refer to (De Raedt, 1997). 

The standard ILP setting (also known as learning from entailment) is usually 
formulated as follows: 

Given: 

• a set of positive examples £J+ and a set of negative examples E~ 

• and a background theory B, 

Find: a hypothesis H (a Prolog program), such that 

• Ve G ^+ : if A B ^ e, and 

• \feeE- -.H ABY^e 

Note that in this setting, an example e is a fact (or clause) that is to be ex- 
plained hy H A B, while in the learning from interpretations setting a property 
of the example (its class) is to be explained by if A B A e. Thus, the latter set- 
ting explicitates the separation between example-specific information and general 
background information. 

The problem specification as given above is natural for the standard ILP setting, 
where one could, for instance, give the following examples for the predicate member: 
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+ : member (a, [a,b,c]). 

+ : member (d, [e,d,c,b]). 

+ : member (d, [d,c,b]). 

member (b, [a,c,d]). 

member (a, [] ) . 

member (d, [c,b]). 

and expect the ILP system to come up with the foUowing definition: 

member (X, [X I Y] ) . 

member (X, [Y|Z]) :- member (X,Z). 

Note that the class of an example (i.e., its truth value) now depends on the class 
of other examples; e.g., the class of member (d, [e,d,c,b]) depends on the class 
of member (d, [d,c,b] ), which is a different example. Because of this property, it 
is in general not possible to find a small subset of the database that is relevant for 
a single example, i.e., local coverage tests cannot be used. Results from computa- 
tional learning theory confirm that learning hypotheses in this setting generally is 
intractable (see e.g. (Dzeroski et al., 1992; Cohen, 1995; Cohen and Page, 1995)). 

Since in learning from interpretations the class of an example is assumed to 
be independent of other examples, this setting is less powerful than the standard 
ILP setting (e.g., for what concerns recursion). With this loss of power comes a 
gain in efficiency, through local coverage tests. The interesting point is that the 
full power of standard ILP is not used for most practical applications, and learning 
from interpretations usually turns out to be sufficient for practical applications, see 
e.g. the proceedings of the ILP workshops and conferences of the last few years (De 
Raedt, 1996; Muggleton, 1997; Lavrac and Dzeroski, 1997; Page, 1998). 



3 Tilde: Induction of First-Order Logical Deci- 
sion Trees 

In this section, we discuss one specific ILP system that learns from interpretations, 
called Tilde (which stands for Top-down Induction of Logical DEcision trees) . This 
system will be used to illustrate the topics discussed in the following sections. 

We first introduce the hypothesis representation formalism used by Tilde, then 
discuss an algorithm for the induction of hypotheses in this formalism. 

3.1 First order logical decision trees 

We will use first order logical decision trees for representing hypotheses. These are 
an upgrade of the well-known propositional decision trees to first order learning. 
A first order logical decision tree (FOLDT) is a binary decision tree in which 

• the nodes of the tree contain a conjunction of literals 

• different nodes may share variables, under the following restriction: a variable 
that is introduced in a node (which means that it does not occur in higher 
nodes) must not occur in the right branch of that node. The need for this 
restriction follows from the semantics of the tree. A variable X that is in- 
troduced in a node, is quantified existentially within the conjunction of that 
node. The right subtree is only relevant when the conjunction fails ("there is 
no such AT"), in which case further reference to X is meaningless. 

An example of such a tree is shown in Figure 0. 
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triangle(X) 




Figure 2: A first order logical decision tree that allows to discriminate the two 
classes for the Bongard problem shown in Figure |l|. 

First order logical decision trees can be converted to normal logic programs (i.e. 
logic programs that allow negated literals in the body of a clause) and to Prolog 
programs. In the latter case the Prolog program represents a first order decision 
list, i.e. an ordered set of rules where a rule is only relevant if none of the rules 
before it succeed. Each clause in such a Prolog program ends with a cut. We refer 
to (Blockeel and De Raedt, 1998) for more information on the relationship between 
first order decision trees, first order decision lists and logic programs. 

The Prolog program equivalent to the tree in Figure ^ isQ 

class(pos) :- triangle(X), inside(X,Y), !. 
class (neg) :- triangle (X) , !. 
class (neg) . 

Figure ^ shows how to use FOLDTs for classification. We use the following 
notation: a tree T is either a leaf with class c, in which case we write T = leaf (c) , 
or it is an internal node with conjunction conj, left branch left and right branch 
right, in which case we write T — ino de(conj, left, right). 

Because an example e is a Prolog program, a test in a node corresponds to 
checking whether a query <— C succeeds in e A B (with B the background knowl- 
edge). Note that it is not sufficient to use for C the conjunction conj in the node 
itself. Since conj may share variables with nodes higher in the tree, C consists 
of several conjunctions that occur in the path from the root to the current node. 
More specifically, C is of the form Q A conj, where Q is the conjunction of all the 
conditions that occur in those nodes on the path from the root to this node where 
the left branch was chosen. We call Q the associated query of the node. 

When an example is sorted to the left, Q is updated by adding conj to it. 
When sorting an example to the right, Q need not be updated: a failed test never 
introduces new variables. E.g., if in Figure || an example is sorted down the tree, in 
the node containing inside (X,Y) the correct test is triangle (X) , inside (X,Y); 
it is not correct to test inside (X,Y) on its own. 

3.2 The Tilde system 

First order logical decision trees can be induced in very much the same manner as 
propositional decision trees. The generic algorithm for this is usually referred to 

^The Prolog program entails class (c) instead of c, in order to ensure that the cuts have the 
intended meaning; this is a merely syntactical difference with the original task formulation. 
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procedure classify(e : example) returns class: 

Q := true 
N := root 

while TV ^ leaf(c) do 

let N = inode(conj, left, right) 

if Q A conj succeeds in e A B 

then Q := Q A conj 
N := left 

else A'' := right 
return c 



Figure 3: Classification of an example using an FOLDT (with background knowl- 
edge B) 



procedure buildtree(T: tree, E: set of examples, Q: query): 

<— Qb '■= element of /o(<~ Q) with highest gain (or gain ratio) 
if <— Qb is not good /* e.g. does not yield any gain at all */ 
then T := leaf (majority .class (E)) 
else 

conj := Qb - Q 

El := {e e -Bl <— Qb succeeds in e A B} 

E2 ■■= {e e -E| ^ Qb fails in e A B} 

buildtree(/e/t, Ei, Qb) 

buildtree (ri^/ii, E2, Q) 

T := inode (cory', left, right) 

procedure Tilde(T: tree, E: set of examples): 

buildtree(r, E, true) 



Figure 4: Algorithm for first-order logical decision tree induction 
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as TDIDT: top-down induction of decision trees. Examples of systems using this 
approach are C4.5 (Quinlan, 1993a) and CART (Breiman et al., 1984). 

The algorithm we use for inducing first order decision trees is shown in Figure ^ 
The Tilde system (Blockeel and De Raedt, 1998) is an implementation of this 
algorithm that is based on C4.5. It uses the same heuristics, the same post-pruning 
algorithm, etc. 

The main point where our algorithm differs from C4.5 is in the computation of 
the set of tests to be considered at a node. C4.5 only considers tests comparing 
an attribute with a value. Tilde, on the other hand, generates possible tests by 
means of a user-defined refinement operator. Roughly, this operator specifies, given 
the associated query of a node, which literals or conjunctions can be added to the 
query. 

More specifically, the refinement operator is a refinement operator under 0-sub- 
sumption (Plotkin, 1970; Muggleton and De Raedt, 1994). Such an operator p maps 
clauses onto sets of clauses, such that for any clause c and Vc' G p(c), c 0-subsumes 
c'. A clause ci 6'-subsumes another clause C2 if and only if there exists a variable 
substitution 6 such that ciO C C2. The operator could for instance add literals to 
the clause, or unify several variables in it. The use of such refinement operators is 
standard practice in ILP. 

In order to refine a node with associated query Q, Tilde computes Q) 
and chooses the query ^ Qb & Q) that results in the best split. The best split 
is the one that maximizes a certain quality criterion; in the case of Tilde this is by 
default the information gain ratio, as defined by Quinlan (1993a). The conjunction 
put in the node consists of Qh — Q, i.e., the literals that have been added to Q in 
order to produce Qjj. 

Example 5 Consider the tree in Figure |^. Assuming that the root node has already 
been filled in with the test triangle (X) , how does Tilde process the left child 
of it? This child has as associated query <— triangle (X) . TiLDE now generates 
/5(<— triangle(X)). According to the language bias specified by the user (see below), 
a possible result could be (we use semicolons to separate the elements of p, as the 
comma denotes a conjunction in Prolog) 

p(<— triangle(X)) = { ^ triangle(X), inside(X,Y); 

^ triangle (X) , inside (Y,X); 

•*— tricingle (X) , square (Y); 

<— tricingle (X) , circle (Y) } 

Assuming the best of these refinements is Qt = triangle (X), inside (X,Y) the 
conjunction put in the node is Qb — Q = inside (X,Y) . 

Language bias 

While prepositional systems usually have a fixed language bias, most ILP systems 
make use of a language bias that has been provided by the user. The language bias 
specifies what kind of hypotheses are allowed; in the case of Tilde: what kind of 
literals or conjunctions of literals can be put in the nodes of the tree. This bias 
follows from the refinement operator, so it is sufficient to specify the latter. The 
specific refinement operator that is to be used is defined by the user in a Progol- 
like manner (Muggleton, 1995). A set of facts of the form rmode(n: conjunction) 
is provided, indicating which conjunctions can be added to a query, the maximal 
number of times the conjunction can be added (i.e. the maximal number of times it 
can occur in any path from root to leaf, n), and the modes and types of its variables. 

To illustrate this, we return to the example of the Bongard problems. A suitable 
refinement operator definition in this case would be 
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rmode (5 
rmode (5 
rmode (5 
rmode (5 
rmode (5 
rmode (5 
rmode (5 



triangle (+-V) ) . 
square (+-V) ) . 
circle (+-V)) . 
inside (+V,+-W) ) . 
inside(-V,+W)) . 
conf ig(+V,up) ) . 
conf ig (+V , down) ) 



The mode of an argument is indicated by a +, — or H — sign before a variable. 
+ stands for input: the variable should already occur in the associated query of the 
node where the test is put. — stands for output: the variable has to be one that 
does not occur yet. H — means that the argument can be both input and output; 
i.e. the variable can be a new one or an already existing one. Note that the names 
of the variables in the rmode facts are formal names; when the literal is added to 
a clause actual variable names are substituted for them. Also note that a literal 
can have multiple modes, e.g. the above facts specify that at least one of the two 
arguments of inside has to be input. 

This rmode definition tells TiLDE that a test in a node may consist of check- 
ing whether an object that has already been referred to has a certain shape (e.g. 
triangle (X) with X an already existing variable), checking whether there exists an 
object with a certain shape in the picture (e.g. triangle (Y) with Y not occurring 
in the associated query), testing the configuration (up or down) of a certain object, 
and so on. At most 5 literals of a certain type can occur on any path from root to 
leaf (this is indicated by the 5 in the rmode facts). 

The decision tree shown in Figure |2| conforms to this specification. When Tilde 
builds this tree, in the root node only the tests triangle (X), square (X) and 
circle (X) are considered, because each other test requires some variable to oc- 
cur in the associated query of the node (which for the root node is true) . The left 
child node of the root has as associated query ^ triangle (X) , which contains one 
variable X, hence the tests that are considered for this node are: 

triangle(X) triangle(Y) inside(X,Y) points(X,up) 

square(X) square(Y) inside(Y,X) points (X, down) 

circle(X) circle(Y) 

Assuming that inside(X,Y) yields the best split, this literal is put in the node. 

In addition to rmodes, so-called lookahead specifications can be provided. These 
allow Tilde to perform several successive refinement steps at once. This alleviates 
the well-known problem in ILP (see e.g. (Quinlan, 1993b)) that a refinement may 
not yield any gain, but may introduce new variables that are crucial for classifica- 
tion. By performing successive refinement steps at once. Tilde can look ahead in 
the refinement lattice and discover such situations. 

For instance, lookaiead (triangle (T) , points (T, up) ) specifies that when- 
ever the literal triangle (T) is considered as possible addition to the current asso- 
ciated query, additional refinement by adding points (T, up) should be tried in the 
same refinement step. Thus, both triangle (T) and triangle (T) , points (T, up) 
would be considered as possible addition. This is useful because normally Tilde can 
construct the test triangle (T) , points (T, up) only by first putting triangle (T) 
in the node, then putting points(T,up) in its left child node. But if triangle (X) 
already occurs in the associated query, then triangle (T) cannot yield any gain (if 
you already know that there is a triangle, the question "is there a triangle" will 
not give you new information) and hence would never be selected, and this would 
prevent points (T, up) from being added as well. 

This lookahead method is very similar to lookahead methods that have been 
proposed for propositional decision tree learners. While for propositional systems 
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the advantage of lookahead is generally considered to be marginal, it is much greater 
in ILP because of the occurrence of variables. 

We finally mention that Tilde handles numerical data by means of a discretiza- 
tion algorithm that is based on Fayyad and Irani's (1993) and Dougherty et al.'s 
(1995) work, but extends it to first order logic (Van Laer et al., 1997). The al- 
gorithm accepts input of the form discretize (Query, Var), with Var a variable 
occurring in Query. It runs Query in all the examples, collecting all instantiations 
of Var that can be found, and finally generates discretization thresholds based on 
this set of instantiations. Since this discretization procedure is not crucial to this 
paper, we refer to (Van Laer et al., 1997; Blockeel and De Raedt, 1997) for more 
details. 

Input Format 

A data set is presented to Tilde in the form of a set of interpretations. Each 
interpretation consists of a number of Prolog facts, surrounded by a begin and end 
line. The background knowledge is simply a Prolog program. Examples of this will 
be shown in Section ^. 

Applications of Tilde 

Although the above discussion of Tilde takes the viewpoint of induction of clas- 
sifiers, the use of first order logical decision trees is not limited to classification. 
Numerical predictions can be made by storing numbers instead of classes in the 
leaves; such trees are usually called regression trees. Another task that is impor- 
tant for data mining, is clustering. Induction of cluster hierarchies can also be done 
using a TDIDT approach, as is explained in (Blockeel et al., 1998). 

It should be clear, therefore, that the techniques that will be described later in 
this text should not be seen as specific for the classification context. They have a 
much broader application domain. 

4 Upgrading Propositional KDD Techniques for 

Tilde 

In this section we discuss how existing propositional KDD techniques can be up- 
graded to first order learning in our setting. The Tilde system will serve as a case 
study here. Indeed, all of the techniques proposed below (except sampling) have 
been implemented in Tilde. We stress, however, that the methodology of upgrad- 
ing KDD techniques is not specific for Tilde, nor for induction of decision trees. It 
can also be used for rule induction, discovery of association rules, and other kinds 
of discovery. Systems such as Claudien (De Raedt and Dehaspe, 1997), ICL (De 
Raedt and Van Laer, 1995) and Warmr (Dehaspe and De Raedt, 1997) are illustra- 
tions of this. Both learn from interpretations and upgrade propositional techniques. 
ICL learns first order rule sets, upgrading the techniques used in CN2, and Warmr 
learns a first order equivalent of association rules ( "association rules over multiple 
relations"). Warmr has been designed specifically for large databases and employs 
an efficient algorithm that is an upgrade of Apriori (Agrawal et al., 1996). 

4.1 Different Implementations of Tilde 

We discuss two different implementations of Tilde: one is a straightforward im- 
plementation, following closely the TDIDT algorithm. The other is a more so- 
phisticated implementation that aims specifically at handling large data sets; it is 
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for each refinement ^ Qi: 

/* counterftruej and counter [false] are class distributions, 

i.e. arrays mapping classes onto their frequencies */ 
for each class c : counter [true] [c] := 0, counter [false] [c] :— 
for each example e: 

if <— Qi succeeds in e 

then increase counter [true] [class(e)] by 1 

else increase counter[false][class(e)] by 1 
Si := weighted_average_class_entropy(counter[true], counter [false]) 
Qh '.= that Qi for which Si is minimal /* highest gain * / 

Figure 5: Computation of the best test Qfc in Tilde classic. 

based on work by Mehta et al. (1996) , and as such is our first example of how 
propositional techniques can be upgraded. 

4.1.1 A straightforward implementation: TiLDEcZassic 

The original Tilde implementation, which we will refer to as TiLDEcZassic, is based 
on the algorithm shown in Figure ^. This is the most straightforward way of im- 
plementing TDIDT. 

Noteworthy characteristics are that the tree is built depth- first, and that the 
best test is chosen by enumerating the possible tests and for each test computing 
its quality (to this aim the test needs to be evaluated on every single example), as 
is shown in Figure ||. This algorithm should be seen as a detailed description of line 
6 in Figure ^. 

Note that with this implementation, it is crucial that fetching an example from 
the database in order to query it is done as efficiently as possible, because this 
operation is inside the innermost loop. For this reason, TiLDEc/assic loads all data 
into main memory when it starts up. Localization is then achieved by using the 
module system of the Prolog engine in which Tilde runs. Each example is loaded 
into a different module, and accessing an example is done by changing the currently 
active module, which is a very cheap operation. One could also load all the examples 
into one module; no example selection is necessary then, and all data can always 
be accessed directly. The disadvantage is that the relevant data needs to be looked 
up in a large set of data, so that a good indexing scheme is necessary in order to 
make this approach efficient. We will return to this in the section on experiments. 

We point out that, when examples are loaded into different modules, TiLDEc/assic 
partially exploits the locality assumption (in that it handles each individual exam- 
ple independently from the others, but still loads all the examples in main memory). 
It does not exploit this assumption at all when all the examples are loaded into one 
module. 

4.1.2 A more sophisticated implementation: TildeLD5 

Mehta et al. (1996) proposed an alternative implementation of TDIDT that is 
oriented towards mining large databases. With their approach, the database is 
accessed less intensively, which results in an important efficiency gain. We have 
adopted this approach for an alternative implementation of Tilde, which we call 
TildeLD5 (LDS stands for Large Data Sets). 

The alternative algorithm is shown in Figure ^ It differs from TiLDEcZassic in 
that the tree is now built breadth- first, and examples are loaded into main memory 
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procedure TildeLDS: 
S := {root} 
while S ^ (j) do 

/* add one level to the tree */ 

for each example e that is not covered by a leaf node: 
load e 

A'' := the node in S that covers e 
<— Q := associated_query(7V) 
for each refinement <— Qj of <— Q: 
if ^ Qi succeeds in e 

then increase counter[A'',i,true][class(e)] by 1 
else increase counter [A'',i,false][class(e)] by 1 
for each node N € S : 

remove N from S 

^ Qb ■■= bcst_test(iV) 

if <— Qb is not good 

then N := leaf(majority_class(A'')) 

else 

Q := associatcd_qucry(A^) 
conj -.^ Qb - Q 
N :— inode(cory, left, right) 
add left and right to S 

function bcst_test(iV: node) returns query: 

<— Q := associated_query(A^) 

for each refinement <— of <— Q: 

CDi := counter[A^,i,true] 
CDr := counter[A^,i, false] 

Si := weighted_averagc_class_entropy((7£>;, CDr) 
Qb '■= that Qi for which Sj is minimal 
return <— Qb 



Figure 6: The TildeLDS algorithm 



one at a time. 

The algorithm works level- wise. Each iteration through the while loop will 
expand one level of the decision tree. S contains all nodes at the current level of 
the decision tree. To expand this level, the algorithm considers all nodes N in S. 
For each node and for each refinement in that node, a separate counter (to compute 
class distributions) is kept. The algorithms makes one pass through the data, during 
which for each example that belongs to a non-leaf node N it tests all refinements 
for N on the example and updates the corresponding counters. 

Note that while for Tilde classic the example loop was inside the refinement 
loop, the opposite is true now. This minimizes the number of times a new example 
must be loaded, which is an expensive operation (in contrast with the previous 
approach where all examples were in main memory and examples only had to be 
"selected" in order to access them, examples arc now loaded from disk). In the 
current implementation each example needs to be loaded at most once per level of 
the tree ("at most" because once it is in a leaf it need not be loaded anymore), hence 
the total number of passes through the data file is equal to the depth of the tree, 
which is the same as was obtained for propositional learning algorithms (Mehta et 
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al, 1996). 

The disadvantage of this algorithm is that a four-dimensional array of counters 
needs to be stored instead of a two-dimensional one (as in Tilde c/assic), because 
different counters are kept for each node and for each refinement. 

Care has been taken to implement TildeLDS in such a way that the size of the 
data set that can be handled is not restricted by internal memory (in contrast to 
Tilde c/assic). Whenever information needs to be stored the size of which depends 
on the size of the data set, this information is stored on disk.^ When processing 
a certain level of the tree, the space complexity of TildeLDS' therefore contains a 
component 0{r ■ n) with n the number of nodes on that level and r the (average) 
number of refinements of those nodes (because counters are kept for each refine- 
ment in each node), but is constant in the number of examples. This contrasts 
with TiLDEcZassic where space complexity contains a component 0(m) with m the 
number of examples (because all examples are loaded at once). 

While memory now restricts the number of refinements that can be considered 
in each node and the maximal size of the tree, this restriction is unimportant in 
practice, as the number of refinements and the tree size are usually much smaller 
than the upper bounds imposed by the available memory. Therefore I^ildeLDS 
typically consumes less memory than TiLDEcZassic, and may be preferable even 
when the whole data set can be loaded into main memory. 

4.2 Sampling 

While the above implementation is one step towards handling large data sets, there 
will always be data sets that are too large to handle. An approach that is often taken 
by data mining systems when there are too many examples, is to select a sample 
from the data and learn from that sample. Such techniques are incorporated in e.g. 
C4.5 (Quinlan, 1993a) and CART (Breiman et al., 1984). 

In the standard ILP context there are some difficulties with sampling, which 
can be ascribed to the lack of a locality assumption. When one example contains 
information that is relevant for another example, either both examples have to be 
included together in the sample, or none of them should. Otherwise, one obtains 
a sample in which some examples have an incomplete description (and hence are 
noisy). It is even possible that no good sample can be drawn because all the exam- 
ples are related to one another. To the best of our knowledge sampling has received 
little attention inside ILP, as is also noted by Fiirnkranz (1997a) and Srinivasan 
(1998). 

If the locality assumption can be made, such sampling problems do not occur. 
Picking individual examples from the population in a random fashion, independently 
from one another, is sufficient to create a good sample. 

Automatic sampling has not been included in the current Tilde implementa- 
tions. We do not give this high priority because TiLDE learns from a flat file of data 
which is produced by extracting information from a database and putting related 
information together (as explained earlier in this text). Sampling should be done at 
the level of the extraction of information, not by Tilde itself. It is rather inefficient 
to convert the whole database into a fiat file and then use only a part of that file, 
instead of only converting the part of the database that will be used. 

We do not present experiments with sampling, as the effect of sampling in data 
mining is out of the scope of this paper; instead we refer to the already existing 
studies on this subject (see e.g. (Muggleton, 1993; Fiirnkranz, 1997b; Srinivasan, 
1999)). 

•^The results of all queries for each example are stored in this manner, so that when the best 
query is chosen after one pass through the data, these results can be retrieved from the auxiliary 
file, avoiding a second pass through the data. 



16 



4.3 Internal Validation 



Internal validation means that a part of the learning set (the validation set) is kept 
apart for validation purposes, and the rest is used as the training set for building 
the hypothesis. Such a methodology is often followed for tuning parameters of a 
system or for pruning. Similar to sampling, partitioning the learning set is easy 
if the locality assumption holds, otherwise it may be hard; hence learning from 
interpretations makes it easier to incorporate validation based techniques in an ILP 
system. 

4.4 Scalability 

De Raedt and Dzeroski (1994) have shown that in the learning from interpretations 
setting, learning first-order clausal theories is tractable. More specifically, given 
fixed bounds on the maximal length of clauses and the maximal arity of literals, 
such theories are polynomial-sample polynomial-time PAC-learnable. This positive 
result is related directly to the learning from interpretations setting. 

Quinlan (1986) has shown that induction of decision trees has time complexity 
0{a ■ N ■ n) where a is the number of attributes of each example, N is the number 
of examples and n is the number of nodes in the tree. Since Tilde uses basically 
the same algorithm as Quinlan, it inherits the linearity in the number of examples 
and in the number of nodes. The main difference between Tilde and C4.5, as we 
already noted, is the generation of tests in a node. 

The number of tests to be considered in a node depends on the refinement 
operator. There is no theoretical bound on this, as it is possible to define refinement 
operators that cause an infinite branching factor. In practice, useful refinement 
operators always generate a finite number of refinements, but even then this number 
may not be bounded; the number of refinements typically increases with the length 
of the associated query of the node. Also, the time for performing one single test 
on a single example depends on the complexity of that test (it is in the worst case 
exponential in the length of the conjunction). 

Thus, we can say that induction of first order decision trees has time complexity 
0{N ■ n ■ t ■ c) with t the average number of tests performed in each node and c the 
average time complexity of performing one test for one example, if those averages 
exist. If one is willing to accept an upper bound on the complexity of the theory 
that is to be learned (which was done for the PAC-learning results) and defines 
a finite refinement operator, both the complexity of performing a single test on a 
single example and the number of tests are bounded and the averages do exist. 

Our main conclusion from this is that the time complexity of TiLDE is linear 
in the number of examples. This is a stronger claim than can be made for the 
standard ILP setting. The time complexity also depends on the global complexity 
of the theory and the branching factor of the refinement operator, which are domain- 
dependent parameters. 

5 Experiments 

In this experimental section we try to validate our claims about time complexity 
empirically, and explore some influences on scalability. More specifically, we want 
to: 

• validate the claim that when the localization assumption is exploited, induc- 
tion time is linear in the number of examples (all other things being equal, 
i.e. we control for other influences on induction time such as the size of the 
tree) 
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• study the influence of localization on induction time (by quantifying the 
amount of localization and investigating its effect on the time complexity) 

• investigate how the induction time varies with the size of the data set in more 
practical situations (if we do not control other influences; i.e. a larger data 
set may cause the learner to induce a more complex theory, which in itself has 
an effect on the induction time) 

Before discussing the experiments themselves, we describe the data sets that we 
have used. 

5.1 Description of the Data Sets 

5.1.1 RoboCup Data Set 

This is a data set containing data about soccer games played by software agents 
training for the RoboCup competition (Kitano et al., 1997). It contains 88594 
examples and is 100MB large. Each example consists of a description of the state 
of the soccer terrain as observed by one specific player on a single moment. This 
description includes the identity of the player, the positions of all players and of the 
ball, the time at which the example was recorded, the action the player performed, 
and the time at which this action was executed. Figure |^ shows one example. 

While this data set would allow rather complicated theories to be constructed, 
for our experiments the language bias was very simple and consisted of a proposi- 
tional language (only high-level commands are learned). This use of the data set 
reflects the learning tasks considered up till now by the people who are using it, see 
(Jacobs et al., 1998). This does not influence the validity of our results for relational 
languages, because the propositions are defined by the background knowledge and 
their truth values are computed at runtime, so the query that is really executed is 
relational. For instance, the proposition have_ball, indicating whether some player 
of the team has the ball in its possession, is computed from the position of the player 
and of the ball. 

5.1.2 Poker Data Sets 

The Poker data sets are artificially created data sets where each example is a de- 
scription of a hand of five cards, together with a name for the hand (pair, three of 
a kind, . . . ) . The aim is to learn deflnitions for several poker concepts from a set of 
examples. The classes that are considered here are nothing, pair, two_pairs, 
three_of _a_kind, fullJiouse and f our_of _a_kind. This is, of course, a simpli- 
fication of the real poker domain, where more classes exist and it is necessary to 
distinguish between e.g. a pair of queens and a pair of kings; but this simplified 
version suffices to illustrate the relevant topics and keeps learning times sufficiently 
low to allow for reasonably extensive experiments. 

Figure H illustrates how one example in the poker domain can be represented. We 
have created the data sets for this domain using a program that randomly generates 
examples for this domain. The advantage of this approach is its fiexibility: it is easy 
to create multiple training sets of increasing size, as well as an independent test set. 

An interesting property of this data set is that some classes, e.g. f our_of _a_kind, 
are very rare, hence a large data set is needed to learn these classes (assuming the 
data are generated randomly). 

5.1.3 Mutagenesis Data Set 

The Mutagenesis dataset (Srinivasan et al., 1996) is a classic benchmark in Inductive 
Logic Programming. The set that has been used most often in the literature consists 
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begin(iiiodel(e71)) . 

player (my , 1 , -48 . 804436 , -0 . 16494742 , 339) . 

player (my , 2 , -34 . 39789 , 1 . 009709 1 , 362) . 

player (my , 3 , -32 . 628735 , -18 . 981379 , 304) . 

player (my , 4 , -27 . 1478 , 1 . 3262547 , 362) . 

player (my , 5 , -3 1 . 55078 , 18 . 985638 , 362) . 

player (my , 6 , -41 . 653893 , 15 . 659259 , 357) . 

player (my , 7 , -48 . 964966 , 25 . 731588 , 352) . 

player (my , 8 , -18 . 363993 , 3 . 815975 , 362) . 

player (my , 9 , -22 . 757153 , 32 . 208805 , 347) . 

player (my ,10,-12. 914384 ,11. 456045 , 362) . 

player (my , 1 1 , - 10 . 190831 , 14 . 468359 , 18) . 

player (other ,1,-4. 242554 ,11. 635328 , 314) . 

player(other,2,0.0,0.0,0) . 

player (other ,3,-13. 048958 , 23 . 604038 , 299) . 

player(other,4,0.0,0.0,0) . 

player (other ,5,2. 4806643 , 9 . 412553 , 341) . 

player (other ,6,-9. 907758 , 2 . 6764495 , 362) . 

player(other,7,0.0,0.0,0) . 

player(other,8,0.0,0.0,0) . 

player ( other ,9,-4.2189126,9. 296844 , 339) . 

player (other ,10,0. 4492856 ,11. 43235 , 158) . 

player(other, 11,0.0,0.0,0) . 

ball (-32 . 503292 , .81057936 , 362) . 

mynumber(5) . 

retime (362) . 

turn (137. 4931640625) . 

actiontime (362) . 
end(model(e71)) . 

Figure 7: The Prolog representation of one example in the RoboCup data set. A 
fact such as player (other, 3, -13. 048958, 23. 604038, 299) means that player 3 of 
the other team was last seen at position (-13,23.6) at time 299. A position of (0,0) 
means that that player has never been observed by the player that has generated this 
model. The action performed currently by this player is turn(137. 4931640625): 
it is turning towards the ball. 



begin (model (4) ) . 

card(7 , spades) . 

card (queen, hearts) . 

card (9 , clubs) . 

card (9 , spades) . 

card (ace, diamonds) . 

pair . 
end (model (4)) . 

Figure 8: An example from the Poker data set. 
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begin (model ( 1 ) ) . 
pos . 

atom(dl_l,c, 22,-0. 117) . 
atom(dl_2,c, 22,-0. 117) . 
atom(dl_3,c, 22,-0. 117) . 
atom(dl_4,c, 195,-0.087) 
atoni(dl_5,c, 195,0.013) . 
atom(dl_6,c, 22,-0. 117) . 
(...) 

atom(dl_25, 0,40, -0.388) 
atom(dl_26 , o ,40 , -0 . 388) 

bond(dl_l,dl_2,7) . 
bond(dl_2,dl_3,7) . 
bond(dl_3,dl_4,7) . 
bond(dl_4,dl_5,7) . 
bond(dl_5,dl_6,7) . 
bond(dl_6,dl_l,7) . 
bond(dl_l,dl_7,l) . 
bond(dl_2,dl_8,l) . 
bond(dl_3,dl_9,l) . 
(...) 

bond(dl_24,dl_19,l) . 
bond(dl_24,dl_25,2) . 
bond(dl_24,dl_26,2) . 
end(inodel (1) ) . 



Figure 9: The Prolog representation of one example in the Mutagenesis data set. 
The atom facts enumerate the atoms in the molecule. For each atom its element 
(e.g. carbon), type (e.g. carbon can occur in several configurations; each type corre- 
sponds to one specific configuration) and partial charge. The bond facts enumerate 
all the bonds between the atoms (the last argument is the type of the bond: single, 
double, aromatic, etc.). pos denotes that the molecule belongs to the positive class 
(i.e. is mutagenic). 



of 188 examples. Each example describes a molecule. Some of these molecules are 
mutagenic (i.e., may cause DNA mutations), others are not. The task is to predict 
the mutagenicity of a molecule from its description. 

The data set is a typical ILP data set in that the example descriptions are highly 
structured, and there is background knowledge about the domain. Several levels 
of background knowledge have been studied in the literature (see again Srinivasan 
et al. (1996)); for our experiments we have always used the simplest background 
knowledge, i.e. only structural information about the molecules (the atoms and 
bonds occurring in them) are available. 

Figure ^ shows a part of the description of one molecule. 

5.2 Materials and Settings 

All experiments were performed with the two implementations of Tilde we dis- 
cussed: TiLBEclassic and TildeL_D5'. These programs are implemented in Prolog 
and run under the MasterProlog engine (formerly named ProLog-by-BIM) . The 
hardware we used is a Sun Ultra-2 at 167 MHz, running the Solaris system (except 
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when stated otherwise). 

Both Tilde classic and TildeL_D5' offer the possibihty to precompile the data 
file. We exploited this feature for all our experiments. For TildeLDS this raises 
the problem that in order to load one example at a time, a different object file has 
to be created for each example (MasterProlog offers no predicates for loading only 
a part of an object file). This can be rather impractical. For this reason several 
examples are usually compiled into one object file; a parameter called granularity 
(G) controls how many examples can be included in one object file. 

Object files are then loaded one by one by TilbeLDS, which means that G 
examples at a time are loaded into main memory (instead of one). Because of this, 
the granularity parameter can affect the efficiency of TildeLDS'. This is investigated 
in our experiments. 

By default, a value of 10 was used for G. 



5.3 Experiment 1: Time Complexity 
5.3.1 Aim of the Experiment 

As mentioned before, induction of trees with TildeLDS should in principle have a 
time complexity that is linear in the number of examples. With our first experiment 
we empirically test whether our implementation indeed exhibits this property. We 
also compare it with other approaches where the locality assumption is exploited 
less or not at all. 

We distinguish the following approaches: 

• loading all data at once in main memory without exploiting the locality as- 
sumption (the standard ILP approach) 

• loading all data at once in main memory, exploiting the locality assumption; 
this is what Tilde cZassic does 

• loading examples one at a time in main memory; this is what TildeLDS does 

To the best of our knowledge all ILP systems that do not learn from interpre- 
tations follow the first approach (with the exception of a few systems that access 
an external database directly instead of loading the data into main memory, e.g. 
Rdt/db (Morik and Brockhausen, 1997) ; but these systems still do not make a 
locality assumption). We can easily simulate this approach with Tilde c/assic by 
specifying all information about the examples as background knowledge. For the 
background knowledge no locality assumption can be made, since all background 
knowledge is potentially relevant for each example. 

The performance of a Prolog system that works with a large database is im- 
proved significantly if indexes are built for the predicates. On the other hand, 
adding indexes for predicates creates some overhead with respect to the internal 
space that is needed, and a lot of overhead for the compiler. The MasterProlog 
system by default indexes all predicates, but this indexing can be switched off. We 
have performed experiments for the standard ILP approach both with and with- 
out indexing (thus, the first approach in the above list is actually subdivided into 
"indexed" and "not indexed"). 



5.3.2 Methodology 

Since the aim of this experiment is to determine the influence of the number of 
examples (and only that) on time complexity, we want to control as much as possible 



other factors that might also have an influence. We have seen in Section 4.4 that 



these other factors include the number of nodes n, the average number of tests per 
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node t and the average complexity of performing one test on one single example c. 
c depends on both the complexity of the queries themselves and on the example 
sizes. 

When varying the number of examples for our experiments, we want to keep 
these factors constant. This means that first of all the refinement operator should 
be the same for all the experiments. This is automatically the case if the user 
does not change the refinement operator specification (the rmode facts) between 
consecutive experiments. 

The other factors can be kept constant by ensuring that the same tree is built in 
each experiment, and that the average complexity of the examples does not change. 
In order to achieve this, we adopt the following methodology. We create, from a 
small data set, larger data sets by including each single example several times. By 
ensuring that all the examples occur an equal number of times in the resulting data 
set, the class distribution, average complexity of testing a query on an example etc. 
are all kept constant. In other words, all variation due to the influence of individual 
examples is removed. 

Because the class distribution stays the same, the test that is chosen in each 
node also stays the same. This is necessary to ensure that the same tree is grown, 
but not sufficient: the stopping criterion needs to be adapted as well so that a 
node that cannot be split further for the small data set is not split when using the 
larger data set either. In order to achieve this, the minimal number of examples 
that have to be covered by each leaf (which is a parameter of Tilde) is increased 
proportionally to the size of the data set. 

By following this methodology, the mentioned unwanted influences are filtered 
out of the results. 



5.3.3 Materials 

We used the Mutagenesis data set for this experiment. Other materials are as 



described in Section 5.2 



5.3.4 Setup of the Experiment 

Four different versions of Tilde are compared: 

• TiLDEc^asszc without locality assumption, without indexing 

• TiLDEclassic without locality assumption, with indexing 

• TiLDEclassic with locality assumption 

• TilbeLDS 

The first three "versions" are actually the same version of Tilde as far as the 
implementation of the learning algorithm is concerned, but differ in the way the 
data are represented and in the way the underlying Prolog system handles them. 

Each Tilde version was first run on the original data set, then on data sets 
that contain each original example 2" times, with n ranging from 1 to 9. Table |l] 
summarizes some properties of the data sets that were obtained in this fashion. 

For each run on each data set we have recorded the following: 

• the time needed for the induction process itself (in CPU-seconds) 

• the time needed to compile the data (in CPU-seconds). The different systems 
compile the data in different ways (e.g. according to whether indexes need 
to be built). As compilation of the data need only be done once, even if 
afterwards several runs of the induction system are done, compilation time 
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Table 1: Properties of the example sets 



multiplication factor 


^examples 


#facts 


size (MB) 


1 


188 


10512 


0.25 


2 


376 


21024 


0.5 


4 


752 


42048 


1 


8 


1504 


84096 


2 


16 


3008 


168192 


4 


32 


6016 


336384 


8 


64 


12032 


672768 


16 


128 


24064 


1,345,536 


32 


256 


48128 


2,691,072 


65 


512 


96256 


5,382,144 


130 



Table 2: Scaling properties of TildeL_D5' in terms of the number of examples 



multiplication time (CPU seconds) 
factor induction compilation 



1 


123 


3 


2 


245 


6.3 


4 


496 


12.7 


8 


992 


25 


16 


2026 


50 


32 


3980 


97 


64 


7816 


194 


128 


15794 


391 


256 


32634 


799 


512 


76138 


1619 



may seem less relevant. Still, it is important to see how the compilation scales 
up, since it is not really useful to have an induction method that scales linearly 
if it needs a preprocessing step that scales super-linearly. 

5.3.5 Discussion of the Results 

Tables |, | | and | give an overview of the time each Tilde version needed to 
induce a tree for each set, as well as the time it took to compile the data into the 
correct format. The results are shown graphically in Figure |l^. Note that both 
the number of examples and time are indicated on a logarithmic scale. Care must 
be taken when interpreting these graphs: a straight line does not indicate a linear 
relationship between the variables. Indeed, if logy = n * log a;, then y = x". This 
means the slope of the line should be 1 in order to have a linear relationship, while 2 
indicates a quadratic relationship, and so on. In order to make it easier to recognize 
a linear relationship (slope 1), the function y — x has been drawn on the graphs as 
a reference. 

Note that only TildeLDS scales up well to large data sets. The other versions 
of Tilde had problems loading or compiling the data from a multiplication factor 
of 16 or 32 on. 

The graphs and tables show that induction time is linear in the number of ex- 
amples for TildeLDS', for TiLBEclassic with locality, and for TiLDEclassic without 
locality but with indexing. For TiLDEcZassic without locality or indexing the in- 
duction time increases quadratically with the number of examples. This is not 
unexpected, as in this setting the time needed to run a test on one single example 
increases with the size of the dataset. 
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Table 3: Scaling properties of TiLBEclassic in terms of the number of examples 



multiplication time (CPU seconds) 
factor induction compilation 



1 


26.3 


6.8 


2 


42.5 


13.7 


4 


75.4 


27.1 


8 


148.7 


54.2 


16 


296.1 


110.1 


32 


7* 


217.1 



Prolog engine failed to load the data 



Table 4: Scaling properties of TiLDE without locality assumption, with indexing, 
in terms of number of examples 



multiplication time (CPU seconds) 
factor induction compilation 



1 


26.1 


20.6 


2 


45.2 


293 


4 


83.9 


572 


8 


176.7 


1640 


16 


?* 


5381 


32 


7* 


18388 



* Prolog engine failed to load the data 



Table 5: Scaling properties of Tilde without locality assumption, without indexing, 
in terms of number of examples 



multiplication time (CPU seconds) 
factor induction compilation 



1 


2501 


2.85 


2 


12385 


5.91 


4 


51953 


12.21 


8 


207966 


25.47 


16 


7* 


52.25 


32 




7** 



Prolog engine failed to load the data 

" Prolog compiler failed to compile the data 
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Figure 10: Scaling properties of TlLDELfS" in terms of number of examples 
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With respect to compilation times, we note that aU are hnear in the size of 
the data set, except TiLDEclassic without locahty and with indexing. This is in 
correspondence with the fact that building an index for the predicates in a deductive 
database is an expensive operation, super-linear in the size of the database. 

Furthermore, the experiments confirm that TiLDEclassic with locality scales as 
well as TildeLDS with respect to time complexity, but for large data sets runs into 
problems because it cannot load all the data. 

Observing that without indexing induction time increases quadratically, and 
with indexing compilation time increases quadratically, we conclude that the locality 
assumption is indeed crucial to our linearity results, and that loading only a few 
examples at a time in main memory makes it possible to handle much larger data 
sets. 

5.4 Experiment 2: The Effect of Localization 
5.4.1 Aim of the experiment 

In the previous experiment we studied the effect of the number of examples on 
time complexity, and observed that this effect is different according to whether the 
locality assumption is made. In this experiment we do not just distinguish between 
localized and not localized, but consider gradual changes in localization, and thus 
try to quantify the effect of localization on the induction time. 



5.4.2 Methodology 

We can test the influence of localization on the efficiency of TildeL_C'5' by varying 
the granularity parameter G in TildeLDS'. G is the number of examples that are 
loaded into main memory at the same time. Localization of information is stronger 
when G is smaller. 

The effect of G was tested by running TildeLDS successively on the same 
data set, under the same circumstances, but with different values for G. In these 
experiments G ranged from 1 to 200. For each value of G both compilation and 
induction were performed ten times; the reported times are the means of these ten 
runs. 



5.4.3 Materials 

We have used three data sets: a RoboCup data set with 10000 examples, a Poker 
data set containing 3000 examples, and the Mutagenesis data set with a multipli- 
cation factor of 8 (i.e. 1504 examples). The data sets were chosen to contain a 
sufficient number of examples to make it possible to let G vary over a relatively 
broad range, but not more (to limit the experimentation time). 



Other materials are as described in Section 5.2 



5.4.4 Discussion of the Results 

Induction times and compilation times are plotted versus granularity in Figure |ll]. 
It can be seen from these plots that induction time increases approximately linearly 
with granularity. For very small granularities, too, the induction time can increase. 
We suspect that this effect can be attributed to an overhead of disk access (loading 
many small files, instead of fewer larger files). A similar effect is seen when we 
look at the compilation times: these decrease when the granularity increases, but 
asymptotically approach a constant. This again suggests an overhead caused by 
compiling many small files instead of one large file. The fact that the observed 
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effect is smallest for Mutagenesis, where individual examples are larger, increases 
the plausibility of this explanation. 

This experiment clearly shows that the performance of TildeLDS strongly de- 
pends on G, and that a reasonably small value for G is preferable. It thus confirms 
the hypothesis that localization of information is advantageous with respect to time 
complexity. 



5.5 Experiment 3: Practical Scaling Properties 
5.5.1 Aim of the experiment 

With this experiment we want to measure how well TildeLDS scales up in practice, 
without controlling any influences. This means that the tree that is induced is not 
guaranteed to be the same one or have the same size, and that a natural variation 
is allowed with respect to the complexity of the examples as well as the complexity 
of the queries. This experiment is thus meant to mimic the situations that arise in 
practice. 

Since different trees may be grown on different data sets, the quality of these 
trees may differ. We investigate this as well. 



5.5.2 Methodology 

The methodology we follow is to choose some domain and then create data sets with 
different sizes for this domain. TildeLDS is then run on each data set, and for each 
run the induction time is recorded, as well as the quality of the tree (according to 
different criteria, see below). 



5.5.3 Materials 

Data sets from two domains were used: RoboCup and Poker. These domains were 
chosen because large data sets were available for them. For each domain several 
data sets of increasing size were created. 

Whereas induction times have been measured on both data sets, predictive accu- 
racy has been measured only for the Poker data set. This was done using a separate 
test set of 100,000 examples, which was the same for all the hypotheses. 

For the RoboCup data set interpretability of the hypotheses by domain experts 
is the main evaluation criterion (because these theories are used for verification of 
the behavior of agents, see (Jacobs et al., 1998)). 

The RoboCup experiments have been run on a SUN SPARCstation-20 at 100 
MHz; for the Poker experiments a SUN Ultra-2 at 167 MHz was used. 



5.5.4 Discussion of the Results 

Table ^ shows the consumed CPU-times in function of the number of examples, as 
well as the predictive accuracy. These figures are plotted in Figure 12, Note that 
the CPU-time graph is again plotted on a double logarithmic scale. 

With respect to accuracy, the Poker hypotheses show the expected behavior: 
when more data are available, the hypotheses can predict very rare classes (for 
which no examples occur in smaller data sets), which results in higher accuracy. 

The graphs further show that in the Poker domain, TildeLUS' scales up linearly, 
even though more accurate (and slightly more complex) theories are found for larger 
data sets. 

In the RoboCup domain, the induced hypotheses were the same for all runs 
except the 10000 examples run. In this single case the hypothesis was more simple 
and, according to the domain expert, less informative than for the other runs. This 
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Figure 11: The effect of granularity on induction time (full range, and zoomed 
on interval [0 — 30]) and compilation time 
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Table 6: Consumed CPU-time and accuracy of hypotheses produced by TildeLDS 
in the Poker domain 



#examples compilation induction accuracy 
(CPU-seconds) (CPU-seconds) 



300 


1.36 


288 


0.98822 


1000 


4.20 


1021 


0.99844 


3000 


12.36 


3231 


0.99844 


10000 


41.94 


12325 


0.99976 


30000 


125.47 


33394 


0.99976 


100000 


402.63 


121266 


1.0 



Induction - 
Compilation ■ 



# examples 



""Aecufacy"^ 



# examples 



Figure 1 2 : Consumed CPU-time and accuracy of hypotheses produced by 
TildeLDS in the Poker domain, plotted against the number of examples 
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Table 7: Consumed CPU-time of hypotheses produced by TildeL£'5' in the 
RoboCup domain 
^examples compilation induction 



10000 


274 


1448 ± 44 


20000 


522 


4429 ± 83 


30000 


862 


7678 ± 154 


40000 


1120 


9285 ± 552 


50000 


1302 


6607 ± 704 


60000 


1793 


13665 ± 441 


70000 


1964 


29113 ± 304 


80000 


2373 


28504 ± 657 


88594 


2615 


50353 ± 3063 




10000 20000 30000 40000 50000 60000 70000 80000 90000 
# examples 



Figure 13: Consumed CPU-time for TildeL£>S' in the RoboCup domain, plotted 
against the number of examples 
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suggests that in this domain a relatively small set of examples (20000) suffices to 
learn from. 

It is harder to see how TildeLDS' scales up for the RoboCup data. Since the 
same tree is returned in all runs except the 10000 examples run, one would expect 
the induction times to grow linearly. However, the observed curve does not seem 
linear, although it does not show a clear tendency to be super-linear either. Because 
large variations in induction time were observed, we performed these runs 10 times; 
the estimated mean induction times are reported together with their standard errors. 
The standard errors alone cannot explain the observed deviations, nor can variations 
in example complexity (all examples are of equal complexity in this domain). 

A possible explanation is the fact that the Prolog engine performs a number 
of tasks that are not controlled by Tilde, such as garbage collection. In specific 
cases, the Prolog engine may perform many garbage collections before expanding 
its memory space (this happens when the amount of free memory after garbage^ col- 
lection is always just above some threshold), and the time needed for these garbage 
collections is included in the mc^asured CPU-times. The MasterProlog engine is 
known to sometimes exhibit such behavior. 

In order to sort this out, TildeLDS' would have to be reimplemented in a lower- 
level language than Prolog, where one has full control over all computations that 
occur. Such a reimplementation is planned. 

Due to the domain-dependent character of these complexity results, one should 
be careful when generalizing them; it seems safe to conclude, however, that the 
linear scaling property has at least a reasonable chance of occurring in practice. 

6 Related Work 

Our work is closely related to eS'orts in the propositional learning field to increase 
the capability of machine learning systems to handle large databases. It has been 
infiuenced more specifically by a tutorial on data mining by Usama Fayyad, in which 
the work of Mehta and others was mentioned (Mehta et al., 1996; Shafer et al, 1996). 
They were the first to propose the level-wise tree building algorithm we adopted, 
and to implement it in the SLIQ (Mehta et al., 1996) and SPRINT (Shafer et al., 

1996) systems. The main difference with our approach is that SLIQ and SPRINT 
learn from one single relation, while TildeLDS can learn from multiple relations. 

Related work inside ILP includes the Rdt/db system (Morik and Brockhausen, 

1997) , which presents the first approach to coupling an ILP system with a relational 
database management system (RDBMS). Being an ILP system, Rdt/db also learns 
from multiple relations. The approach followed is that a logical test that is to 
be performed is converted into an SQL query and sent to an external relational 
database management system. This approach is essentially different from ours, in 
that it exploits as much as possible the power of the RDBMS to efficiently evaluate 
queries. Also, there is no need for preprocessing the data. Disadvantages are that 
for each query an external database is accessed, which is relatively slow, and that it 
is less flexible with respect to background knowledge. Furthermore, to obtain good 
performance complex modifications to the RDBMS system (tailoring it towards 
data mining) are needed. Preliminary experiments with coupling Claudien and 
Tilde to an Oracle RDBMS confirmed these claims and caused us to abandon such 
an approach. 

We also mention the KEPLER system (Wrobel et al., 1996) , a data mining tool 
that provides a framework for applying a broad range of data mining systems to 
data sets; this includes ILP systems. KEPLER was deliberately designed to be very 
open, and systems using the learning from interpretations setting can be plugged 
into it as easily as other systems. 
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At this moment few systems use the learning from interpretations setting (De 
Raedt and Van Laer, 1995; De Raedt and Dehaspe, 1997; Dehaspe and De Raedt, 
1997). Of these the research described in (Dehaspe and De Raedt, 1997) (the 
Warmr system: finding association rules over multiple relations; sec; also Dehaspe 
and Toivonen's contribution in this issue) is most closely related to the work de- 
scribed in this paper, in the sense that there, too, an effort was made to adapt 
the system for large databases. The focus of that text is not on the advantages of 
learning from interpretations in general, however, but on the power of first order 
association rules. 

More loosely related work inside ILP would include all efforts to make ILP sys- 
tems more efficient. Since most of this work concerns ILP systems that work in the 
classical ILP setting, the ways in which this is done usually differ substantially from 
what we describe in this paper. For instance, the well-known ILP system Progol 
(Muggleton, 1995) has recently been extended with caching and other efhciency 
improvements (Cussens, 1997). Other directions are the use of sampling techniques 
and stochastic methods, such as proposed by, e.g., Srinivasan (1999) and Sebag 
(1998). 

Finally, the TiLDE system is related to other systems that induce first order 

decision trees, such as the Struct system (Watanabe and Rendell, 1991) (which 
uses a less explicitly logic-based approach) and the regression tree learner SRT 
(Kramer, 1996). 

7 Conclusions 

We have argued and demonstrated empirically that the use of ILP is not limited 
to small databases, as is often assumed. Mining databases of a hundred megabytes 
was shown to be feasible, and this does not seem to be a limit. 

The positive results that have been obtained are due mainly to the use of the 
learning from interpretations setting, which is more scalable than the classical ILP 
setting and makes the link with propositional learning more clear. This means that 
a lot of results obtained for propositional learning can be extrapolated to learning 
from interpretations. We have discussed a number of such upgrades, using the 
TilbeLDS system as an illustration. The possibility to upgrade the work by Mehta 
et al. (1996) has turned out to be crucial for handling large data sets. It is not 
clear how the same technique could be incorporated in a system using the classical 
ILP setting. 

Although we obtained specific results only for a specific kind of data mining (in- 
duction of decision trees), the results are generalizable not only to other approaches 
within the classification context (e.g. rule based approaches) but also to other in- 
ductive tasks within the learning from interpretations setting, such as clustering, 
regression and induction of association rules. 
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